Mining Programming Language Vocabularies from Source Code

نویسندگان

Daniel P. Delorey

Charles D. Knutson

Mark Davies

چکیده

We can learn much from the artifacts produced as the by-products of software development and stored in software repositories. Of all such potential data sources, one of the most important from the perspective of program comprehension is the source code itself. While other data sources give insight into what developers intend a program to do, the source code is the most accurate human-accessible description of what it will do. However, the ability of an individual developer to comprehend a particular source file depends directly on his or her familiarity with the specific features of the programming language being used in the file. This is not unlike the difficulties second-language learners may encounter when attempting to read a text written in a new language. We propose that by applying the techniques used by corpus linguists in the study of natural language texts to a corpus of programming language texts (i.e., source code repositories), we can gain new insights into the communication medium that is programming language. In this paper we lay the foundation for applying corpus linguistic methods to programming language by 1) defining the term “word” for programming language, 2) developing data collection tools and a data storage schema for the Java programming language, and 3) presenting an initial analysis of an example linguistic corpus based on version 1.5 of the Java Developers Kit.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Lightweight Approach to Uncover Technical Information in Unstructured Data

Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical information such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide rang...

متن کامل

XML and the art of code maintenance

We present three distinct XML vocabularies and demonstrate integration with o:XML source code. The vocabularies relate to interface documentation, unit tests and Design By Contract conditions. By layering the information with XML namespaces, syntax conflicts are avoided and selective processing can be performed using standard XML tools. Automated unit tests and the implementation of Design By C...

متن کامل

Declarative Visitors to Ease Fine-grained Source Code Mining with Full History on Billions of AST Nodes by Robert Dyer, Hridesh Rajan, and Tien N. Nguyen

Software repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. Specifically, mining source code has yielded significant insights into software development artifacts and processes. Unfortunately, mining source code at...

متن کامل

MedlineR: an open source library in R for Medline literature data mining

SUMMARY We describe an open source library written in the R programming language for Medline literature data mining. This MedlineR library includes programs to query Medline through the NCBI PubMed database; to construct the co-occurrence matrix; and to visualize the network topology of query terms. The open source nature of this library allows users to extend it freely in the statistical progr...

متن کامل

Template Mining in Source-code Digital Libraries

As a greater number of software developers make their source code available, there is a need to store such opensource applications into a repository, and facilitate search over the repository. The objective of this research is to build a digital library of Java source code, to enable search and selection of source code. We believe that such a digital library will enable better sharing of experi...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Mining Programming Language Vocabularies from Source Code

نویسندگان

چکیده

منابع مشابه

A Lightweight Approach to Uncover Technical Information in Unstructured Data

XML and the art of code maintenance

Declarative Visitors to Ease Fine-grained Source Code Mining with Full History on Billions of AST Nodes by Robert Dyer, Hridesh Rajan, and Tien N. Nguyen

MedlineR: an open source library in R for Medline literature data mining

Template Mining in Source-code Digital Libraries

عنوان ژورنال:

اشتراک گذاری